This report explores in depth numerous different White Wines.
Load the Packages
Univariate Plots Section
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
This dataset includes 11 different input variables with over 4898
observations of White Wines and 1 output variable (Quality).
I have decided to remove the uneeded variable X because it surves no real
I decided to do a quick plot with every variable to try and better
understand the data and also to see the distribution. These plots are
of a normal distribution type. There appears to be numerous outliers with
pretty high counts of fixed acidity, volatile acidity, and sulphates.
Density appears to be the only plot with a limited number of outliers.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
After plotting the above plots they also show a mostly normal distribution.
Clorides seem to be the only real prevailing variable with a decient amount
of outliers. pH, free and total sulfur also have a couple of outliers.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
After plotting the residual sugar plot it is skewed to the left (prehaps it
is showing less White Wines in the distribution?) and the alcohol plot is
pretty spread out.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
After transforming the data using a log tranformation is shows a bi-modal
distribution. White Wines are considered sweet if they have a residual sugar
content greater then 45, and it appears there is less sweet White Wines with
After looking at the citric acid distribution it appears there is an
interesting spike in citric acid around .5. And after transforming the data
with log10 it appears to be mostly normal.
Univariate Analysis
What is the structure of your dataset?
The Dataset is made up of 4898 observation of White Wines with 11 inputs
(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol)
and 1 output (quality).
What is the main features of interest in your dataset?
I think the main features of this White Wine dataset are Alcohol(%) as
well as Residual Sugar. They are the 2 main variables that appear to not
have a normal distribution.
What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?
Chlorides, Volatile Acidity, and total sulfur dioxide seem to play a smaller
part in the quality of the White Wine. Citric acid also has an interesting
spike around .5.
Did you create any new variables from existing variables in the dataset?
I created a new variable called quality_fac to aid in the factoring of the
quality of some of my plots and to show better visualizations of the data
in the Multivariate Section below.
Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the
Biavariate Plots

Using GGpairs to see any apparent correlations with the data. There appears
to be multiple correlations between a number of variables in the dataset to
explore including density/residual sugar and alcohol/quality, among many
I will explore.

This plot is also a great example of the correlation between all the
variables. It shows the strong relations between residual sugar/density and
alcohol/density as well as many others to explore.

## [1] 0.09942725

## [1] 0.4355747

## [1] -0.009209091

## [1] -0.09757683
I am comparing quality to numerous other variables to get a sense as to what
goes into making a high quality White Wine.
The best quality White Wines seem to have a pH of 3.0 to 3.5, alcohol content
of between 10 and 13, medium to high levels of citric acid (.25-.5), and low
residual sugars(0 - 18). I was pretty suprised to see the high alcohol count
as being a pretty good factor in high quality white wines. It also appears
the lower residual sugars wines are more appealing.
## WW$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
## --------------------------------------------------------
## WW$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## WW$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## WW$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## WW$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## WW$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## WW$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
## WW$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.588 4.600 6.392 10.700 16.200
## --------------------------------------------------------
## WW$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## WW$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## WW$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## WW$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## WW$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## WW$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
In comparing the summaries of the variables together it appears the numbers
with the alcohol content show a trend toward higher levels as the quality
gets better, but the numbers on the residual sugars are not so apparent.

## [1] -0.3071233

## [1] -0.2099344

## [1] 0.05367788

## [1] -0.1136628
They also seen to have a lower density, lower chlorides (.2 - .6), somewhat
lower amount of sulphates, and fixed acidity between 4 - 8. It appears lower
density is related to the quality as well as lower chlorides. I figured
the opposite would be true with the chlorides and possibly density.
## WW$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## WW$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## WW$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## WW$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## WW$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## WW$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## WW$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
## WW$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## WW$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## WW$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## WW$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## WW$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## WW$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## WW$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
It appears the higher quality wines do have lower density and chlorides.
Apparently the less sweet and salty wines are of better quality as shown
by the summary comparisons.

## [1] 0.0890207

## [1] 0.5298813

## [1] -0.7801376

## [1] 0.8389665
There is definitely a linear relationship between density and total sulfur
dioxide as seen in the plot (.53) above as well as a negative linear
relationship between alcohol and density (-.78). After comparing density and
residual sugar, they appears to have a very strong linear relationship also
(.839). There doesn’t appear to be any relation between fixed acidity and
residual sugar. I guess I wasn’t suprised by any of the findings except the
density and total sulfur dioxide, I thought density was more tied to the
sugar/alcohol content.
Bivariate Analysis
Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?
The GGpairs plot was pretty interesting because it put everything together
and showed correlations between the variables. One of the biggest was the
correlation between density and residual sugars as well as density and
alcohol.
All the higher quality White Wines have a medium level of pH between 3-3.5,
higher level of alcohol content, mid-high citric acid level, lower residual
sugars, lower chlorides, lower density, and somewhat lower fixed acidity.
Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?
Density seems to be closely related to residual sugar and fixed acidity.
Alcohol also seems to be related to density and total sulfur dioxide, and it
appears the higher alcohol content it has the less dense the White Wine is.
What was the strongest relationship you found?
The strongest relationship appears to be between Density and Residual Sugar
(.839) because as sugar builds in the wines so does the density.
Mulivariate Plot Section
Adding factor to quality for ranking purposes.

I wanted to explore the density and alcohol relationship more due to the
strength it has with the quality factor of wines. It is pretty apparent
that the quality of wines goes up with the higher alcohol levels and lower
density levels which also directly affect each other.


Both plots show no real correlation between sulphates and total or free
sulfur dioxide on the quality of White Wine. I was curious because sulphates
tend to contribute to sulfur dioxide levels according to the description of
attributes.

This plot matrix brings all the combinations together in one easy to
view plot. Quality White Wines have an above average level of citric acid,
lower level of chlorides, higher level of alcohol, and medium to high level
of fixed acidity compared to the lower quality White Wines.

Once again the Quality of White Wines but this time with the Density and
Alcohol switched around showing the strong linear relationship as well as the
quality factor.


This is a better look at the different quality White Wines compared to
Density and Alcohol together and then seperate in order to see the
difference. I find it fascinating that alcohol and density are so closely
Density seems to be closely tied to the alcohol content as well as possibly
total sulfur dioxide. The higher the alcohol content the less dense the
White Wines appear to be.
Multivariate Analysis
Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
I was suprised to find that the higher quality White Wines seem to have a
higher alcohol content which in turn means a lower density. I thought that
the opposite would be true due to the taste of alcohol.
I was also suprised to find that the higher quality White Wines had a medium
to high level of citric acid as well as low levels of chlorides(salt).
Were there any interesting or surprising interactions between features?
I thought it was definitely interesting that as the alcohol content goes
up the density goes down.
OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.
I did not create a model.
Final Plots and Summary
Plot One

The Density and Residual Sugar of the White Wines have a strong linear
relationship as shown in the above plot
Plot Two

The higher quality White Wines have a higher alcohol content and lower
density than the lower quality White Wines. It also shows a strong linear
relationship between density and alcohol.
Plot Three


Both of these boxplots seems to back up my findings that higher citric
acid, higher alcohol content, and lower chlorides make a better quality
White Wine.
Reflection
This dataset contained 4898 observations of White Wine Quality with 11
inputs and 1 output. After exploring the data in detail I can say for certain
I know alot more about Wine than I have ever known. At first I was
concentrating strictly on what variables are needed to make a high quality
wine then after more research I started wondering how the variables related
with one another. I was very suprised to find that density is closely related
to alcohol content as well as residual sugars. The more dense the wine was
the less alcohol content it contained.
After examing the sulphates and sulfur dioxide I was very suprised to learn
they are not closely correlated as it mentioned in the description of
attributes that sulphates can contribute to sulfur dioxide gas levels. It
appears that density and total sulfur dioxide have a linear realtionship also
that could be futhur examined.
I had some trouble bringing together the multiple variables without including
quality, but as I moved away from the only output variable (quality) it
was apparent that there was some strong linear relationships between the
other variables. I was also stumped as to why density, residual sugar, and
alcohol are so closely related to each other until I started doing some
research and saw the linear relationships between the three.
I think there is opportunity for furthur understanding of the makeup of
a good quality wine with more data on a wider number of white wines. Breaking
up the data into the 7 classes of whites would also allow you to gain more
understanding how the different variables that make up the wines react and
come together to form a high quality wine.